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Abstract 

We establish a Bernstein-type inequality for a class of stochastic processes that include the clas¬ 
sical geometrically ^-mixing processes, Rio’s generalization of these processes, as well as many 
time-discrete dynamical systems. Modulo a logarithmic factor and some constants, our Bernstein- 
type inequality coincides with the classical Bernstein inequality for i.i.d. data. We further use this 
new Bernstein-type inequality to derive an oracle inequality for generic regularized empirical risk 
minimization algorithms and data generated by such processes. Applying this oracle inequality to 
support vector machines using the Gaussian kernels for both least squares and quantile regression, 
it turns out that the resulting learning rates match, up to some arbitrarily small extra term in the 
exponent, the optimal rates for i.i.d. processes. 


1 Introduction 

Concentration inequalities such as Hoeffding’s inequality, Bernstein’s inequality, McDiarmid’s inequal¬ 
ity, and Talagrand’s inequality play an important role in many areas of probability. For example, the 
analysis of various methods from non-parametric statistics and machine learning crucially depend on 
these inequalities, see e.g. ifPJj ;2(~) 22 421. Here, stronger results can typically be achieved by Bern¬ 
stein’s inequality and/or Talagrand’s inequality, since these inequalities allow for localization due to 
their specific dependence on the variance. In particular, most derivations of minimax optimal learning 
rates are based on one of these inequalities. 

The concentration inequalities mentioned above all assume the data to be generated by an i.i.d. pro¬ 
cess. Unfortunately, however, this assumption is often violated in several important areas of applica¬ 
tions including financial prediction, signal processing, system observation and diagnosis, text and speech 
recognition, and time series forecasting. For this and other reasons there has been some effort to establish 
concentration inequalities for non-i.i.d. processes, too. For example, generalizations of Bernstein’s in¬ 
equality to a-mixing and ^-mixing processes have been found Ifl0l[33ll32l and f38l . respectively. Among 
many other applications, the Bernstein-type inequality established in iTTOll was used in BOl to obtain con¬ 
vergence rates for sieve estimates from a-mixing strictly stationary processes in the special case of neural 
networks. Furthermore, l23l applied the Bernstein-type inequality in lf33l to derive an oracle inequality 
for generic regularized empirical risk minimization algorithms learning from stationary a-mixing pro¬ 
cesses. Moreover, by employing the Bemstein-type inequality in ll32l . 171 derived almost sure uniform 
rates of convergence for the estimated Levy density both in mixed-frequency and low-frequency setups 
and proved that these rates are optimal in the minimax sense. Finally, in the particular case of the least 
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square loss, O obtained the optimal learning rate for 0-mixing processes by applying the Bernstein-type 
inequality established in l(38l . 

However, there exist many dynamical systems such as the uniformly expanding maps given in fT71 
p. 41] that arc not o-mixing. To deal with such non-mixing processes Rio lf34ll introduced so-called 
0-mixing coefficients, which extend the classical 0-mixing coefficients. For dynamical systems with ex¬ 
ponentially decreasing, modified (^-coefficients, ll47l derived a Bemstein-type inequality, which turns out 
to be the same as the one for i.i.d. processes modulo some logarithmic factor. However, this modification 
seems to be significant stronger than Rio’s original 0-mixing, so it remains unclear when the Bernstein- 
type inequality in WT\ is applicable. In addition, the 0-mixing concept is still not large enough to cover 
many commonly considered dynamical systems. To include such dynamical systems, PTil proposed the 
C-mixing coefficients, which further generalize 0-mixing coefficients. 

In this work, we establish a Bernstein-type inequality for geometrically C-mixing processes, which, 
modulo a logarithmic factor and some constants, coincides with the classical one for i.i.d. processes. 
Using the techniques developed in lf23il . we then derive an oracle inequality for generic regularized 
empirical risk minimization and C-mixing processes. We further apply this oracle inequality to a state- 
of-the-art learning method, namely support vector machines (SVMs) with Gaussian kernels. Here it 
turns out that for both, least squares and quantile regression, we can recover the (essentially) optimal 
rates recently found for the i.i.d. case, see Bill , when the data is generated by a geometrically C-mixing 
process. Finally, we establish an oracle inequality for the problem of forecasting an unknown dynamical 
system. This oracle will make it possible to extend the purely asymptotic analysis in Hll to learning 
rates. 

The rest of this work is organized as follows: In Section [2 we recall the notion of (time-reversed) 
C-mixing processes. We further illustrate this class of processes by some examples and discuss the 
relation between C -mixing and other notions of mixing. As the main result of this work, a Bernstein-type 
inequality for geometrically (time-reversed) C-mixing processes will be formulated in Section [3] There, 
we also compare our new Bernstein-type inequality to previously established concentration inequalities. 
As an application of our Bernstein-type inequality, we will derive the oracle inequality for regularized 
risk minimization schemes in Section |2 We additionally derive learning rates for SVMs and an oracle 
inequality for forecasting certain dynamical systems. All proofs can be found in the last section. 

2 C-mixing processes 

In this section we recall two classes of stationary stochastic processes called (time-reversed) C-mixing 
processes that have a certain decay of correlations for suitable pairs of functions. We also present some 
examples of such processes including certain dynamical systems. 

Let us begin by introducing some notations. In the following, (Q, A. //) always denotes a probability 
space. As usual, we write L p (n) for the space of (equivalence classes of) measurable functions / : 
Q —>• R with finite L p -norm ||/|| p . It is well-known that L p (n) together with ||/|| p forms a Banach 
space. Moreover, if A! C A is a sub-a-algebra, then L\ [A'. fi) denotes the space of all ^/-measurable 
functions / € L\{n). In the following, for a Banach space E, we write Be for its closed unit ball. 

Given a semi-norm || ■ || on a vector space E of bounded measurable functions / : Z —>• M, we define 
the C-Norm by 


\\f\\c ■= ll/lloo + ll/ll 


and denote the space of all bounded C-functions by 

C(Z) := {/ : Z —> R | ||/||c < oo}. 


( 1 ) 

( 2 ) 
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Throughout this work, we only consider the semi-norms || ■ || in (OQ) that satisfy the inequality 


°J 11 < lle^l 


( 3 ) 


for all / € C(Z). We are mostly interested in the following examples of semi-norms satisfying (0. 

Example 2.1. Let Z be an arbitrary set and suppose that we have ||/|| = 0 for all / : Z — > K. Then, it is 
obviously to see that He-' || = ||/|| = 0. Hence, 0 is satisfied. 

Example 2.2. Let Z C R be an interval. A function / : Z — > R is said to have bounded variation on Z if its total 
variation ||/|| w(Z) bounded. Denote by BV(Z) the set of all functions of bounded variation. It is well-known 
that BV{Z) together with ||/||oo + ||/||bv(.z) forms a Banach space. Moreover, we have 0, i.e. we have for all 
/ C C(Z): 

II ef \\ B V{Z) ^ \\ ef \\j\f\\BV(Z)- 

Example 2.3. Let Z be a subset of M' / and Cb(Z) be the set of bounded continuous functions on Z. For / £ 
Cb{Z) and 0 < a < 1 let 


ll/ll := \f\a ■■= sup 

z^z' 


I f(z)-f(z')\ 

I r _ -y'\a 


Clearly, / is a-Holder continuous if and only if |/|„ < oo. The collection of bounded, a-Hdlder continuous 
functions on Z will be denoted by 


C b , a {Z) := {/ £ C b (Z) :\f\ a < oo}. 

Note that, if Z is compact, then Cb,a{Z) together with the norm ||/||c b a := ||/||oo + | f\ a forms a Banach space. 
Moreover, the inequality 0 is also valid for / £ Cb, a {Z). As usual, we speak of Lipschitz continuous functions 
if a = 1 and write Lip (Z) := Cb,i(Z). 

Example 2.4. Let Z C R d be an open subset. For a continuously differentiable function / : Z — > M we write 

ll/ll := sup \ f\z)\ 

z€Z 

and C 1 (Z) := {/ : Z —»• K. | / continuously differentiable and ||/||oo + ||/|| < oo}. It is well-known, that C 1 (Z) 
is a Banach space with respect to the norm || • ||oo + || • || and the chain rule gives 

lie'll = ll(e')'L = lie' • f'\L < l|e'IIJI/'IU = ||e'LII/ll, 

for all / £ C 1 (Z), i.e. 0 is satisfied. 


Let us now assume that we also have a measurable space (Z, B) and a measurable map x '■ ^ —>• Z. 
Then er(\) denotes the smallest cr-algebra on <) for which x is measurable. Moreover, fi x denotes the 
%-image measure of fi, which is defined by [i x (B) := ^(x^ 1 {B)), B £ B. 

Let Z := (Z n ) n > 0 be a Z-valued stochastic process on (0, A, /i), and A' () and AfL n be the cr-algebras 
generated by (Zq, ..., Z{) and (Z l+n , Z, +r)+ i....), respectively. The process Z is called stationary if 
V(z il+i ,...,z in+i ) = M(z ilr ..,z in ) for all n,i,h, ■ ■ ■ An > 1. In this case, we always write P := /r Zo . 
Moreover, to define certain dependency coefficients for Z, we denote, for L, £ L \ (j.i) satisfying 
il>ip € L \ (ji) the correlation of L and cp by 

cor := / / ipd^- / <pdp. 

Several dependency coefficients for Z can be expressed by imposing restrictions on L and ip. The follow¬ 
ing definition, which is taken from OTIl . introduces the restrictions on A and 'r we consider throughout 
this work. 
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Definition 2.5. Let (LLA. p) be a probability space, (Z,B) be a measurable space, Z := (Z, ).,>() be 
a Z-valued, stationary process on f l, and || • ||c be defined by (|T \ for some semi-norm || • ||. Then, for 
n > 0 , we define: 

(i) the C-mixing coefficients by 

4>c{Z, n) := sup{cor(0,/io Z k+n ) : k > 0,0 € £ B C ( Z )j (4) 

(ii) the time-reversed C-mixing coefficients by 

fic,rev(Z,n) := sup {cor(/i o Z k , ip) : k > 0, h £ B C ( Z ),ip £ B Ll (A%> +n ,ifi}- ( 5 ) 

Let (d n ) n > o be a strictly positive sequence converging to 0. Then we say that Z is (time-reversed) 
C-mixing with rate (d n ) n> 0 , if we have 0c,(rev) n) < d n for all n > 0. Moreover, if (d n ) n > o is of the 
form 

d n := cexp(— 6 n 7 ), n > 1 , ( 6 ) 

for some constants b > 0, c > 0, and 7 > 0, then Z is called geometrically (time-reversed) C-mixing. 

Obviously, Z is C-mixing with rate ( d n ) n >o , if and only if for all k,n > 0, all 0 £ L\ (Aq : p), and 
all h £ C(Z), we have 

cor(i/>, h o Z k+n ) < WipWLfi/fiWhWc d n , (7) 

or similarly, time-reversed C-mixing with rate (d n ) n > 0 , if and only if for all k,n > 0, all h £ C(Z), and 
all <p £ L 1 (A k ^_ n , p), we have 


cor{hoZ k ,p>) < \\h\\c\\p\\Lfin) d n- ( 8 ) 

In the rest of this section we consider examples of (time-reversed) C-mixing processes. To begin 
with, let us assume that Z is a stationary 0-mixing process l25l with rate (d n ) n > o- By lfl6l Inequality 
( 1 . 1 )] we then have 


<\\fi\\ Ll{tl) \\p\\ Lxi ^)d n , n> 1, (9) 

for all AIq -measurable 0 £ L \ ip) and all -4S_ n -measurable ip £ L^ip). By taking || • ||c := || • ||oo 
and ip := h o Z k+n , we then see that ([7]) is satisfied, i.e. Z is C-mixing with rate (d n ) n > 0 - Finally, 
by similar arguments we can deduce that time-reversed 0-mixing processes Il2l Section 3.13] arc also 
time-reversed C-mixing with the same rate. In other words we have found 


and <^Loo(/d,rev(^,™) = 0rev(Z,n). 

To deal with processes that are not a-mixing |[35| . Rio ll34l introduced the following relaxation of 
^-mixing coefficients 

kZ,n):= sup ||E(/(Z fc+n )|7lg)- E /(^+n)|L (10) 

fc>0, 

feBv 1 

= sup {cor(0, h o Z k+n ) :k> 0,0 <E B L ^ A k ^,h £ B nv{z) } 

and an analogous time-reversed coefficient 

4>vev(Z, n) := sup ||E(/(Z fc )|^ +n )-E/(Z fc )|| oo 
k>0, 

/eBVj 
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Figure 1: Relationship between </>-, fi-, and C-mixing processes 


= sup {cor(/i o Z k ,p) :k>0,p£ B Ll{A ™ + ^ h € -B w(z) } , 

where the two identities follow from lfl8l Lemma 4]. In other words we have 

fev(Z)(2,n) = ^(2,n) and (p B y^ rey (Z,n) = 4> rew {Z,n) 

Moreover, Op. 41] shows that some uniformly expanding maps are (^-mixing but not a-mixing. Figure 
[U summarizes the relations between 0, <p, and C-mixing. 

Our next goal is to relate C-mixing to some well-known results on the decay of correlations for 
dynamical systems. To this end, recall that (f },A,p,T) is a dynamical system, if T : 0 —>• 0 is a 
measurable map satisfying g(T~ l (A)) = p{A) for all A E A. Let us consider the stationary stochastic 
process Z := (Z n ) n >o defined by Z n := T n for n > 0. Since A C A r J for all n > 0, we conclude 
that Af +n = A\^ r ’ n . Consequently, p is -4j^j_„-measurable, if and only if it is ^'-measurable. More¬ 
over A k A is the a -algebra generated by T k+n , and hence p is -4^”-measurable, if and only if it is of 
the form p = g o T k+n for some suitable, measurable g : S> —> M. Let us now suppose that || ■ ||c(o) is 
defined by £[]) for some semi-norm || • ||. For h E C(Q) we then find 

cor (h o Z k , p) = cor (h o Z k , go Z k+n ) = cor (h, g o Z n ) 

= / h-(g°T n )dg— / hdp- g dp 
Jo, Jo Jo 

=: cor T ,n( h ’9) ■ 

The next result shows that Z is time-reversed C-mixing even if we only have generic constants C(h,g ) 
in ®. 

Theorem 2.6. Let (Q,A,p,T) be a dynamical system and the stochastic process Z := (Z n ) n > o be 
defined by Z n := T n for n > 0. Moreover, Let || • ||c be defined by ©/or some semi-norm || • ||. Then, 
Z is time-reversed C-mixing with rate (d n ) n > o ifff or all h E C(fl) and all g E L\(p) there exists a 
constant C(h,g ) such that 


cor T ,n(h,g) < C(h,g)d n , n> 0 . 

Thus, we see that Z is time-reversed C-mixing, if cor T, n (h,g) converges to zero for all h E C(fl) 
and g E L \ (ji) with a rate that is independent of h and g. 

For concrete examples, let us first mention that lITPl presents some discrete dynamical systems that 
arc time-reversed geometrically C-mixing such as Lasota-Yorke maps, uni-modal maps, piecewise ex¬ 
panding maps in higher dimension. Here, the involved spaces are either BV(Z) or Lip (Z). 
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In dynamical systems where chaos is weak, correlations often decay polynomially, i.e. the correla¬ 
tions satisfy 

\ COT T,n{h, g)\ < C(h,g) ■ vT b , n> 0, (11) 

for some constants b > 0 and C(h,g ) > 0 depending on the functions h and g. Young ll49ll developed 
a powerful method for studying correlations in systems with weak chaos where correlations decay at a 
polynomial rate for bounded g and Holder continuous h. Her method was applied to billiards with slow 
mixing rates, such as Bunimovich billiards, see | 6 } Theorem 3.5]. For example, modulo some logarithmic 
factors lf30UT4ll obtained (fTTT) with b = 1 and 6 = 2 for certain forms of Bunimovich billiards and Holder 
continuous h and g. Besides these results, Baladi Q also compiles a list of “parabolic” or “intermittent” 
systems having a polynomial decay. 

It is well-known that, if the functions h and g are sufficient smooth, there exist dynamical systems 
where chaos is strong enough such that the correlations decay exponentially fast, that is, 

\cor T ,n(h, g)\ < C(h,g) • exp(- 6 n 7 ) , n> 0, (12) 

for some constants 6 > 0, 7 > 0, and C(h,g) > 0 depending on h and g. Again, Baladi [5]] has 
listed some simple examples of dynamical systems enjoying (IT 2 l ) for analytic h and g such as the angle 
doubling map and the Arnold’s cat map. Moreover, for continuously differentiable h and g, lf36l [39l 
proved (fl2l) for two closely related classes of systems, more precisely, C 1+£ Anosov or the Axiom-A 
diffeomorphisms with Gibbs invariant measures and topological Markov chains, which are also known as 
subshifts of finite type, see also fTTl . These results were then extended by 641 1371 to expanding interval 
maps with smooth invariant measures for functions h and g of bounded variation. In the 1990s, similar 
results for Holder continuous h and g were proved for systems with somewhat weaker chaotic behavior 
which is characterized by nonuniform hyperbolicity, such as quadratic interval maps, see f48l . ll27ll and 
the Henon map HO, and then extended to chaotic systems with singularities by ll28l and specifically to 
Sinai billiards in a torus by lt48l[T3ll . For some of these extensions, such as smooth expanding dynamics, 
smooth nonuniformly hyperbolic systems, and hyperbolic systems with singularities, we refer to |(4| as 
well. Recently, for h of bounded variation and bounded g, 11291 obtained (1T21 ) for a class of piecewise 
smooth one-dimensional maps with critical points and singularities. Moreover, 0 has deduced (fT2l for 
h. g G Lip (Z) and a suitable iterate of Poincare’s first return map T of a large class of singular hyperbolic 
flows. 


3 A Bernstein-type inequality 


In this section, we present the key result of this work, a Bernstein-type inequality for stationary geomet¬ 
rically (time-reversed) C-mixing process. 

Theorem 3.1. Let Z := (Z n ) n >0 be a Z-valued stationary geometrically (time-reversed) C-mixing 
process on with rate (d n ) n >0 as in ©, II • lie be defined by dTJ for some semi-norm || • || 

satisfying (0. and P := pz 0 - Moreover, let h G C(Z) with E ph = 0 and assume that there exist some 
A > 0, B > 0 , and a > 0 such that ||/i|| < A, ||6 ||oo E B, andKph 2 < cr 2 . Then, for all E > 0 and all 


n > no := max < min< m > 3 : m > 


2 808c(3A + B) m 


B 


and 


we have 


p\ <ur 


G n : — h o Zi > el J <2 


i= 1 


(log m 


ne 


>4ke^L (13) 


S z exp 


8 (logn)x(cr 2 + eB/3), 


(14) 
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or alternatively, for all n > tiq and r > 0, we have 


j < 2e -r . (15) 

Note that besides the additional logarithmic factor 4(log n)d and the constant 2 in front of the expo¬ 
nential, (fT4l) coincides with Bernstein’s classical inequality for i.i.d. processes. 

In the remainder of this section, we compare Theorem [3j] with some other concentration inequalities 
for non-i.i.d. processes Z. Here, Z is real-valued and h is the identity map if not specified otherwise. 




w G 


n 


2=1 


h(Zi{u)) > 


8 (logn 


o 2 r 


n 


+ 


8 (log n)~< Bt 
3 n 


Example 3.2. Theorem 2.3 in 0] shows that smooth expanding systems on [0,1] have exponential decay of 
correlations 0. Moreover, if, for such expanding systems, the transformation T is Lipschitz continuous and 
satisfies the conditions at the end of Section 4 in lfl8l and the ergodic measure /i satisfies lfT8l condition (4.8)], then 
lfl8l Theorem 2] shows that for all e > 0 and n > 1, the left-hand side of (fl4l> is bounded by 


exp 



9 

£71 


c 


where C is some constant independent of n. The same result has been proved in lff5l Theorem III. 1 ] as well. 
Obviously, this is a Hoeffding-type bound instead of a Bernstein-type one. Hence, it is always larger than ours if 
the denominator of the exponent in (fl4l> is smaller than C. 

Example 3.3. For dynamical systems with exponentially decreasing (/-coefficients, see [47] condition (3.1)], 
m Theorem 3.1] provides a Bernstein-type inequality for 1-Lipschitz functions h : Z —>• [—1/2,1/2] w.r.t. some 
metric d on Z, in which the left-hand side of (ITO is bounded by 


exp 


Ce 2 n \ 
a 2 + elog/(n) J 


(16) 


for some constant C independent of n and f(n) being some function monotonically increasing in n. Note that 
modulo the logarithmic factor log /(n) the bound (IT6l) is the same as the one for i.i.d. processes. Moreover, if /(n) 
grows polynomially, cf. (47} Section 3.3], then (IT6l) has the same asymptotic behaviour as our bound. However, 
geometrically C-mixing is weaker than Condition (3.1) in 1471 : Indeed, the required exponential form of Condition 
(3.1) in 23, i.e. 


sup 4>(Aq, 
k> 0 


ryk-\-2n— 1 


) := sup sup ||E(/( 
k>0feF n 


)|4)-E/(Z 


k+2n —1 
k-\-n 


^ _ „—bn 

< c • e 


for some c, b > 0 and all n > 1, where Z^ 2 " 1 := [Z^+n, ■ ■ ■, ^-+ 2 ^- 1 ) an d T n is the set of 1-Lipschitz 
functions / : Z n — > [-1 w.r.t. the metric d n (x, y) := /■ J]™ =1 d{xi,yf), implies 

sup sup \\E(f(Z k+n )\Ao) - Ef(Z k+n )\\ < c ■ ne~ bn < c-e~ bn 

k>0 f GJ- 


for some c, b > 0 and all n > 1, where T is the set of 1-Lipschitz functions / : Z —t [— i, i] w.r.t. the metric 
d. In other words, processes satisfying Condition (3.1) in j47l are (/-mixing, see (flTTb . which is stronger than 
geometrically C-mixing, see again Figure Q] Moreover, our result holds for all 7 > 0, while li47l only considers 
the case 7 = 1 . 


Example 3.4. For an a-mixing sequence of centered and bounded random variables satisfying a(n) < cexp(— bn 1 ) 
for some constants b > 0, c > 0, and 7 > 0, ||33] Theorem 4.3] bounds the left-hand side of €3} by 


(1 + 4e 2 c) exp 


3 £ 2 n^ \ 
6 er 2 + 2 eB ) 


with n^ 


1 

n ^+ 1 


(17) 


for all n > 1 and all e > 0. In general, this bound and our result are not comparable, since not every a- 
mixing process satisfies 0 and conversely, not every process satisfying 0 is necessarily a-mixing, see Figure|2] 
Nevertheless, for (/-mixing processes, it is easily seen that this bound is always worse than ours for a fixed 7 > 0, 
if n is large enough. 
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Figure 2: Relationship between a-, (/>-, and C-mixing processes 


Example 3.5. For an a-mixing stationary sequence of centered and bounded random variables satisfying a(n) < 
exp(—2 cn) for some c > 0, f32l Theorem 2 ] bounds the left-hand side of (ITdl) by 

Ce 2 n 

v 2 + £-B(logn ) 2 + n~ 1 B 2 

where C > 0 is some constant and 

v 2 :=a 2 + 2 ^ Icov^lXOI. (19) 

2<i<n 

By applying the covariance inequality for a-mixing processes, see m the corollary to Lemma 2.1], we obtain 
v 2 < C,) || A' , || 2 [0 for an arbitrary 8 > 0 and a constant Cs only depending on <5. If the additional S > 0 is ignored, 
m has therefore the same asymptotic behavior as our bound. In general, however, the additional 8 does influence 
the asymptotic behavior. For example, the oracle inequality we obtain in the next section would be slower by a 
factor of rfi, where £ > 0 is arbitrary, if we used (ITSb instead. Finally, note that in general the bound (IT8] > and ours 
are not comparable, see again Figured 

In particular. Inequality (fl 8 l) can be applied to geometrically 0-mixing processes with 7 = 1. By using the 
covariance inequality (1.1) for 0 -mixing processes in am, we can bound v 2 defined as in ( fl9l ) by Ctr 2 with some 
constant C independent of n. Modulo the term n~ 1 B in the denominator, the bound lfl 8 l > coincides with ours for 
geometrically 0-mixing processes with 7 = 1 . However, our bound also holds for such processes with 7 £ (0,1). 




Example 3.6. For stationary, geometrically a-mixing Markov chains with centered and bounded random vari¬ 
ables, Q] bounds the left-hand side of (IT4l) by 

exp ( ~ 2 1 -) > ( 2 °) 

\ a z + eB log n J 

where a 2 = lim „_>, 00 -Var A,-. By a similar argument as in Example 13. 5 1 we obtain 

n 

Var y t Xj = na 2 +2 ^ |cov(Aj, Xj)\ < na 2 + Csn\\Xi\\\ +s 

2=1 


for an arbitrary <5 > 0 and a constant C ,5 depending only on <5. Consequently we conclude that modulo some 
arbitrary small number 8 > 0 and the logarithmic factor logn instead of (log n) 2 , the bound ( I 20 | i coincides with 
ours. Again, this bound and our result are not comparable, see Figure[2] 

Example 3.7. For stationary, weakly dependent processes of centered and bounded random variables with 
|cov(Xi, X n )\ < c ■ exp {—bn) for some c,b > 0 and all n > 1, l26l Theorem 2.1] bounds the left-hand side 
of (IT4l> by 

exp (21) 

where C\ is some constant depending on c and b, and C 2 is some constant depending on c, b, and B. Note that the 
denominator in (f2Tb is at least C\, and therefore the bound (|2H is more of Hoeffding type. 
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4 Applications to Statistical Learning 

In this section, we apply the Bernstein inequality from the last section to deduce oracle inequalities for 
some widely used learning methods and observations generated by a geometrically C-mixing processes. 
More precisely, in Subsection 14.11 we recall some basic concepts of statistical learning and formulate 
an oracle inequality for learning methods that are based on (regularized) empirical risk minimization. 
Then, in the Subsection l4.21 we illustrate this oracle inequality by deriving the learning rates for SVMs. 
Finally, in Subsection 14.31 we present an oracle inequality for forecasting of dynamical systems. 

4.1 Oracle inequality for CR-ERMs 

In this section, let X always be a measurable space if not mentioned otherwise and f Cl always be a 
closed subset. Recall that in the (supervised) statistical learning, our aim is to find a function / : X —>• R 
such that for (x, y) € X x Y the value fix) is a good prediction of y at x. To evaluate the quality of 
such functions /, we need a loss function L : X xY x M —> [0, oo) that is measurable. Following R2l 
Definition 2.22], we say that a loss L can be clipped at M > 0, if, for all (x, y, t) € X x Y x M, we have 

L(x,y,t) < L(x,y,t), (22) 

where t denotes the clipped value of t at ±M, that is t := t if t G [— M, M ], t := —M if t < —M, 
t := M if t > M. Various often used loss functions can be clipped. For example, if Y := {— 1 , 1 } and L 
is a convex, margin-based loss represented by ip : R —y [0, oo), that is L(y, t ) = p{yt) for all y G Y and 
t £ R, then L can be clipped, if and only if ip has a global minimum, see li42l Lemma 2.23]. In particular - , 
the hinge loss, the least squares loss for classification, and the squared hinge loss can be clipped, but the 
logistic loss for classification and the AdaBoost loss cannot be clipped. Moreover, if Y := [— M, M] 
and L is a convex, distance-based loss represented by some r/> : M —» [0, oo), that is L(y, t ) = ip(y — t ) 
for all y € Y and i £ K, then L can be clipped whenever V(0) = 0, see again li42l Lemma 2.23]. In 
particular - , the least squares loss 


L(y,t) = (y-tf 


(23) 


and the r-pinball loss 


L T {y,t) ■■= ip(y - t ) 


-( 1 -' r )(2/- i )> ify-t<0 
r(y-t), if 2/ — t > 0 


(24) 


used for quantile regression can be clipped, if the space of labels Y is bounded. 

Now we summarize assumptions on the loss function L that will be used throughout this work. 

Assumption 4.1. The loss function L : X X Y X M —> [0, oo) can be clipped at some M > 0. Moreover, 
it is both bounded in the sense of L(x,y,t) < 1 and locally Lipschitz continuous, that is, 


\L(x,y,t) - L(x, y, t')\ < | t-t' 


(25) 


Here both inequalites are supposed to hold for all (x, y) G X x Y and t, t' £ [— M, M], Note that the 
fanner assumption can typically be enforced by scaling. 

Given a loss function L and an / : X —> R, we often use the notation L o / for the function 
( x , y) i->- I Ax. y. fix)). Our major goal is to have a small average loss for future unseen observations 
( x , y). This leads to the following definition, see also ll42l Definitions 2.2 & 2.3]. 
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Definition 4.2. Let L : X X Y X 1 —>• [0, oo) be a loss function and P be a probability measure on 
X x Y. Then, for a measurable function f : X — » R the L-risk is defined by 

K L ,p(f) ■= j L(x,y,f(x))dP(x,y). 

XxY 


Moreover, the minimal L-risk 

1Z* L P := uvi{R-L,p{f)\f : X —* R measurable} 

is called the Bayes risk with respect to P and L. In addition, a measurable function f pp : X -X M 
satisfying TLj,. P ( f * L p ) = TVf p is called a Bayes decision function. 

Informally, the goal of learning from a training set D € (X x Y) n is to find a decision function fp 
such that P-L,p(,f d) is close to the minimal risk 1Z* L p . Our next goal is to formalize this idea. We begin 
with the following definition. 

Definition 4.3. Let X be a set and Y C M be a closed subset. A learning method C on X X Y maps 
every set D € (X X Y) n , n > 1, to a function fp : X —> R. 

Let us now describe the learning algorithms we are interested in. To this end, we assume that we 
have a hypothesis set T consisting of bounded measurable functions / : X —> M, which is pre-compact 
with respect to the supremum norm || • Hoc. Since T can be infinite, we need to recall the following, 
classical concept, which will enable us to approximate infinite T by finite subsets. 

Definition 4.4. Let ( T , d) be a metric space and £ > 0. We call S C T an e-net ofT if for all t E T 
there exists an s € S with d(s,t ) < £. Moreover, the e-covering number of T is defined by 

Af(T, d, e ) := inf / n > 1 : 3si,... ,s n G T such that T C j^J Bd(si, e)l , 

where inf 0 := oo and Bfis, e) := {t E T : d(t, s ) < e} denotes the closed ball with center s E T and 
radius £. 

Note that our hypothesis set T is assumed to be pre-compact, and hence for all e > 0, the covering 
number Af(P, || • ||ooj£) is finite. 

In order to introduce our generic learning algorithms, we write 

D := ((X l5 Yi),..., (X n , Y n j) Z n ) € (X x Y) n 

for a training set of length n that is distributed according to the first n components of the X x Y- 
valued process Z = {Zi)i> \ ■ Furthermore, we write D n := ^ Y^i=i \x t ,YiP where (f x^Ya denotes the 
(random) Dirac measure at (X*, Yfi. In other words, D n is the empirical measure associated to the data 
set D. Finally, the risk of a function / : X —y M with respect to this measure 

1 n 

Kl,dM) = /(**)) 

i =1 


is called the empirical L-risk. 

With these preparations we can now introduce the class of learning methods we are interested in, see 
also l42l Definition 7.18]. 
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Definition 4.5. Let L : X x Y x M —?• [0, oo ) be a loss that can be clipped at some M > 0, F be a 
hypothesis set, that is, a set of measurable functions f : X —> M, with 0 6 J, and T be a regularizer 
on F, that is, a function T : F —> [0,oo) with Y(0) = 0. Then, for 8 > 0, a learning method whose 
decision functions fn„.r € F satisfy 

T(/d„,t) +'R'L,D n (fD n ,r) < inf (T (f) + U L ,D n (f)) + S (26) 

J 

for all n > 1 and D n € (X X Y j" is called 5-approximate clipped regularized empirical risk minimiza¬ 
tion (5-CR-ERM) with respect to L, F, and Y. 

Moreover, in the case 5 = 0, we simply speak of clipped regularized empirical risk minimization 
(CR-ERM). 

Note that on the right-hand side of (l26l) the unclipped loss is considered, and hence CR-ERMs do 
not necessarily minimiz e the regularized clipped empirical risk Y(-) + 'R-l,d„0- Moreover, in general 
CR-ERMs do not minimize the regularized risk Y(-) + 'R-l,d u {-) either, because on the left-hand side of 
(l26l) the clipped function is considered. However, if we have a minimizer of the unclipped regularized 
risk, then it automatically satisfies (l26l ). As an example of CR-ERMs, SVMs will be discussed in Section 
l4~2l 

Before we present the oracle inequality for <5-CR-ERMs, we need to introduce a few more notations. 
Let J 7 be a hypothesis set in the sense of Definition 14. 5 1 For 

r* := inf Y(/) + K L , P (f) - TL\ P (27) 


and r > r*, we write 

F r := {/ € 7 : Y(/) + K L , P (f) - U\ P < r} . (28) 

Then we have r* < 1, since L(x,y, 0) < 1, 0 E F, and Y(0) = 0. Furthermore, we assume that we 
have a monotonic decreasing sequence (A.)re(o,i] such that 

|| L o f || < A r for all / G F r and r € (0,1] , (29) 

where || ■ || is a semi-norm satisfying ©. Because of the definition (l28l) . it is easily to conclude that 
|| L o /|| < Ai for all f G F r and r € (0,1]. Finally, we assume that there exists a function ip : (0, oo) —>• 
( 0 , oo) and ap£ ( 0 , 1 ] such that, for all r > 0 and e > 0 , we have 

\n.N{fF r , || • ||oo, e) < p(e)r p . (30) 

Note that there are actually many hypothesis sets satisfying Assumption (l30l) . see ll23l Section 4] for 
some examples. 

Now the oracle inequality for CR-ERMs reads as follows: 

Theorem 4.6. Let Z := (Z,, ) n >o be a Z -valued stationary geometrically ( time-reversed ) C-mixing 
process on with rate (d n ) n > o as in m, ii • lie be defined by £T|) for some semi-norm || • || 

satisfying Q, and P := pz 0 ■ Moreover, let L be a loss satisfying Assumption 14.71 In addition, assume 
that there exist a Bayes decision function /£ p and constants 0 G [0,1] and V > 1 such that 

E P (Lof-Lofl P ) 2 <V- (Ep(Lo/-Lo/* P )) tf , / G F, (31) 

where F is a hypothesis set with 0 G F. We define r*, F r , and A r by 42 71 ). 42 AD . and 429D . respectively 
and assume that f l.fOD is satisfied. Finally, let Y : F [0, oo) be a regularizer with Y(0) = 0, ,/o G F 


11 


be a fixed function, and Aq, A* > 0, Bq > 1 be constants such that \\L o / 0 || < Aq, \\L o /q|| < Aq, 
|| L ° fl P \\ < A* and \\L o fo\\oo < -Bo- Then, for all fixed e > 0, 6 > 0, r > 1, and 


n > n o := max m > 3 : m 2 > Ji mid 



(log m) t 


m 


2 



(32) 


with K = 1212c(4Hq + 2L* + A\ + 1), and r € (0,1] satisfying 


r > max 



cy (log n) t 1 (r + ip{e/2)2 v r v ) 


n 



20 (log n)^ Bqt 


n 


,r 



(33) 


with cy := 512(12)4 + l)/3, every learning method defined by < 1261) satisfies with probability // /e.v.s' 

1 — 1 6e6 r : 



Let us briefly discuss the variance bound (13Tb . For example, if Y = [—M, M] and L is the least 
squares loss, then it is well-known that (l3Tb is satisfied for V := 16M 2 and '0 = 1, see e.g. ll42l Example 
7.3]. Moreover, under some assumptions on the distribution P, li43l established a variance bound of 
the form (I3TI) for the pinball loss used for quantile regression. In addition, for the hinge loss, (I3TI) is 
satisfied for id := q/(q + 1), if Tsybakov’s noise assumption ll46l holds for q, see ll42l Theorem 8.24]. 
Finally, based on j9j, PlOl established a variance bound with id = 1 for the earlier mentioned clippable 
modifications of strictly convex, twice continuously differentiable margin-based loss functions. 

One might wonder, why the constants Aq and Bo are necessary in Theorem 14.61 since it appears to 
add further complexity. However, a closer look reveals that the constants A\ and B are the bounds for 
functions of the form L o /, while Aq and Bq are valid for the function L o / 0 for an unclipped /o € P. 
Since we do not assume that all / € T satisfy / = /, we conclude that in general Aq and Bq are 
necessary. 

The following lemma shows that the required bounds on ||L o / || do hold for specific loss functions, 
if C = Lip and the involved functions / € B are Lipschitz, too. 

Lemma 4.7. Let (X , d) be a metric space, Y C —M. M] with M > 0. Moreover, let f : X M be a 
bounded, Lipschitz continuous function. Then the following statements hold true: 

(i) For the least square loss L, see i l23D . we have 


Lo f\i < 2\[2 (M + H/lloo) (1 + |/|t)- 


(ii) For the r-pinball loss L, see < \24\) . we have 


Bo/|i< V2OL + I/I1). 


4.2 Learning rates for SVMs 

Let us begin by briefly recalling SVMs, see ll42i for details. To this end, let X be a measurable space, 
Y := [—1,1] and A: be a measurable (reproducing) kernel on X with reproducing kernel Hilbert space 
(RKHS) H. Given a regularization parameter A > 0 and a convex loss L, SVMs find the unique solution 


fD n ,x = argmin(A||/||^ + K L , Dn (f)) • 


(35) 
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In particular, SVMs using the least-squares loss (1231) are called least-squares SVMs (LS-SVMs), while 
SVMs using the r-pinball loss (l24l) are called SVMs for quantile regression. 

Note that SVM decision functions (1351) satisfy (1261 ) for the regularizer T := A|| ■ \\ 2 H and 6 := 0. In 
other words, SVMs are CR-ERMs. Consequently we can use the oracle inequality in Theorem 14.61 to 
derive the learning rates for SVMs. 

Assumption 14. 1 l implies that 

M\fD n ,x\\ 2 H < M\fD n ,x\\ 2 H + ^L,D n (/) = min (\\\f\\ 2 H + 7 l L ,D n (f)) < 7^l,d„(0) < 1. 

In other words, for a fix A > 0, we have 

f Dni a € X~ 1/2 B h , (36) 

where Bh denotes the closed unit ball of the RKHS H. 

In the following, we are mainly interested in the commonly used Gaussian RBF kernels k a : X x 
X —> M defined by 

1 / ( Ik - 

k a {x, x ) := exp I-^“ I , x,x € A, 

where X C is a nonempty subset and a > 0 is a free parameter called the width. We write H a for the 
corresponding RKHSs, which are described in some detail in |[44l. The entropy numbers for Gaussian 
kernels ll42l Theorem 6.27] and the equivalence of covering and entropy numbers ll42l Lemma 6.21] 
yield that 

lnAf(B HtT , || • ||oo,e) < aa~ d e~ 2p , e > 0, (37) 

for some constants a > 0 and pG (0,1). 

Because of (l36l) . we can choose the hypothesis set as T = A ~ 1 ^ 2 Bh (T - Then the definition (l28l) 
implies that T r C r' 1 / 2 A — 1 2 and consequently we have 

lnV(J>, || ■ HoojE) < aa^ d X~ p e~ 2p r p , 

and thus, for the function ip in Theorem 14. 61 we can choose 

ip(e) := aa- d X~ p £- 2p . (38) 

Now, with some additional assumptions below, we can use the oracle inequality in Theorem 14.61 to 
derive the learning rates for the SVMs using Gaussian kernels. In the following, B\ s ^ denotes the usual 
Besov space with the smoothness parameter t, more details see lf2Tl Section 2]. 

Theorem 4.8 (Least Square Regression with Gaussian Kernels). Let Y := [— M, M] for M > 0, and 
P be a distribution on R^xf such that X := suppLx C If d is a bounded domain with // ( ()X ) = 0, 
where B f ,i denotes the closed unit ball of d-dimensional Euclidean space Furthermore, let Px be 
absolutely continuous w.r.t. the Lebesgue measure p, on X with associated density <j : R' :/ ^ R such 
that () € Lq(X) for some q > 1. Moreover, let f * r p : M' :/ ^ M be a Bayes decision function such that 
/£ p £ L 2 (M d ) Cl Lip(M ci ) as well as /£ p £ B 2 S OO for some t > 1 and s > 1 with | + ^ = 1. Then, for 
all t; > 0, the LS-SVM using Gaussian RKHS H a and 

_ , i 

A n = n and a n = n 2t + d , (39) 

learns with rate 

n ~2TTd+Z _ (40) 
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It turns out that, modulo the arbitrarily small £ > 0, these learning rates are optimal, see e.g. ll45l 
Theorem 13] or |[22l Theorem 3.2]. 

To achieve these rates, however, we need to set X n and a n as in (l39l ). which in turn requires us 
to know t. Since in practice we usually do not know these values nor their existence, we can use the 
training/validation approach TV-SVM, see e.g. Il42l Chapters 6.5, 7.4, 8.2], to achieve the same rates 
adaptively, i.e. without knowing t. To this end, let A := (A n ) and £ := (£„) be sequences of finite 
subsets A n , £ n C (0,1] such that A n is an e n -net of (0,1] and £ n is an <5 n -net of (0,1] with e n < n ” 1 

_ i 

and 5 n < n 2 + d . Furthermore, assume that the cardinalities |A n | and |£„ grow polynomially in n. For 
a data set D := {(x\,y {),..., (x n , y n )), we define 


D i := {(xi,yi),...,(x m ,y m )) 

J-^2 - = ( {Xm+l 1 2/m+l ) 1 ■■■ 1 ip^ni 2/n)) 

where m := + 1 and n > 4. We will use I)\ as a training set by computing the SVM decision 

functions 


/di,a,<t := arg min A \\f\\ 2 H + Pl,Di (/), (A,cr) G A n x £ n 

and use D 2 to determine (A, a) by choosing a (Ad 2 , od 2 ) € A n x £ n such that 

Then, analogous to the proof of Theorem 3.3 in lf2Tl we can show that for all ( > 0 and £ > 0, the 
TV-SVM producing the decision functions /di,a d ,,ct Do with the above learning rates (l40l ). 

The following remark discusses learning rates for SVMs for quantile regression. For more informa¬ 
tion on such SVMs we refer to iCTR Section 4]. 

Remark 4.9 (Quantile Regression with Gaussian Kernels). Let Y := [—1,1], and P be a distribution on 
M' / x Y such that X := suppPx C B f d be a domain. Furthermore, we assume that, for Py-almost all 
x € X, the conditional measure P(-\x) is absolutely continuous w.r.t. the Lebesgue measure on Y and 
the conditional density h(-,x) of P(-|.x) is bounded from 0 and 00 , see also ll2Tl Example 4.5]. Moreover, 
let Px be absolutely continuous w.r.t. the Lebesgue measure on X with associated density g G L U (X) 
for some u > 1. For r G (0,1), let f* p : M d —» R be a conditional r-quantile function that satisfies 
f* p G L 2 (M d ) Cl Lip(R d ). In addition, we assume that f* p G P 2 s 00 1° r some t > 1 and s > 1 such 
that A + A = 1. Then lf43l Theorem 2.8] yields a variance bound of the form 

E P (P r o / - L r o f* p ) 2 < V ■ E p (L t of-L T o /* P ), 

for all / : X —>• M, where V is a suitable constant and L r is the r-pinball loss. Similar arguments to 
Theorem 14.81 shows that the essentially optimal learning rate (l40l ) can be achieved as well. Note that the 
rate (l40l) is for the excess L r -risk, but since l43l Theorem 2.7] shows 

II f~ /r,p|li 2 (Px) < c(^L T iP (/) ~nl TtP ) 

for some constant c > 0 and all / : X —> M, we actually obtain the same rates for \\ f — f* p\\ 2 l 2 (p x )' 
Last but not least, optimality and adaptivity can be discussed along the lines of LS-SVMs. 
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4.3 Forecasting of dynamical systems 

In this section, we proceed with the study of the forecasting problem of dynamical systems considered 
in 11411 . First, let us recall some basic notations and assumptions. Let be a compact subset of R d , 
(fi, A. fi, T) be a dynamical system, and So £ 11 be a random variable describing the true but unknown 
state at time 0. Moreover, for E > 0, assume that all observations of the stochastic process described 
by the sequence T := (T n ) n >o are additively corrupted by some i.i.d., [-E, L]' / -valued noise process 
£ = (£n)n>o defined on the probability space (@, C. u) which is (stochastically) independent of T. It 
follows that all possible observations of the system at time n > 0 are of the form 

X n =T n (S 0 )+£ n . (41) 

In other words, the process that generates the noisy observations (ITil is (T n (,S’o) + £ n )n>o- In particular, 
a sequence of observations (Xq. .... X n j generated by this process is of the form (ITil for a conjoint 
initial state Sq. 

Now, given an observation of the process T := (T n ) n >o at some arbitrary time, our goal is to forecast 
the next observable state. To do so, we will use the training set 

D n = ({X 0 ,X 1 ),...,{X n - 1 ,X n )) 

= ((5o + £o, T(5o) + £l) , • • •, {T n -\S 0 ) + e n ^,T n (S 0 ) + £„)) 

whose input/output pairs are consecutive observable states. In other words, our goal is to use D n to build 
a forecaster 


f Dn : R d -> R d 

whose average forecasting performance on future noisy observations is as small as possible. In order to 
render this goal, we will use the forecaster 



where jr :i ) is the forecaster obtained by using the training set 

Mi 

D® := ((Xo^jiX,)), . . . , (X^TTjiXn))) 

which is obtained by projecting the output variable of D n onto its jth-coordinate via the coordinate 
projection nj : —>• M. 

In other words, we build the forecaster f Dn by training separately d different decision functions 
on the training sets Dn \ ... , Dn \ These problems can be considered as the (supervised) statistical 
learning problems formulated in Subsection 14. 1 1 with the help of the following Notations. 

For E > 0 and a fixed j € { 1,..., d}, we write X := K + [-E, E] d , Y := n j(X) and Z := X x Y. 
Moreover, we define the X x Y- valued process Z = (Z n ) n >o = (X n ,Y n ) n > o on (K x n®u) 

by X n := T n + £ n and Y n := 7rj(T n+1 + £ n+ i). In addition, we write P := (/x 0 u )(x 0 .Y 0 )- Obviously, 
if the stochastic process T is C-mixing and the noise process £ is i.i.d, then the stochastic processes 

z = (X n ,Y n ) n > 0 = (T n (S 0 ) + £n,7T j(T n+1 (So) + s n+1 )) n > 0 

is C-mixing as well. 

To formulate the oracle inequality for our original (/-dimensional problem, we need to introduce the 
following concepts. Firstly, for the decision function / : W l —>• R d , it is necessary to introduce a loss 
function L : R d —>• [0, oo) such that 

L (Xi - f(Xi _r)) = L (T^Sq) + £* - f(T i ~ 1 (So) + £ t _i)) 
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gives a value for the discrepancy between the forecast f(T '~ 1 (So) + £,_i) and the observation of the 
next state T' (So) + £,;. We say that a loss L : —» [0, oo) can be clipped at M > 0, if, for all 

t = (t \,..., td) € M rf , we have L(t ) < L(t), where t = (ti,... ,tf) denotes the clipped value of t at 
{±M} d . Moreover, the loss function L : M f/ —»■ [0, oo) is called separable, if there exists a distance- 
based loss L : X x Y x K —> [0, oo) such that its representing function -0 : M —>• [0, oo) has a unique 
global minimum at 0 and satisfies 

L(r) = > */>(ri) 4-b ip{r d ), r = (n, • • •, r d ) E R d . (43) 

In our problem-setting, the average forecasting performance is given by the Y-risk 

K L ,p(f) ■= j j L (T(x) + £i - f(x + £ 0 )) v(de) p(dx), (44) 

where e = (£,;),;>o and P := u S) p- Naturally, the smaller the risk, the better the forecaster is. Hence, 
we ideally would like to have a forecaster f* L P : R d —>• M. d that attains the minimal L-risk 

7 Z* L P := inf j 1ZL,p(f)\f : measurablej . (45) 

The assumption (l43l) then implies IZL.p(f) = J2j =i T^lpUd 0)) an ^ 

d 

Kl.vMdJ = Y. n L.n(p( f D(p^ 
i=i 

where D„. D„ are the empirical measures associated to D n , D ri respectively. 

Finally, let L : —>• [0, oo) be a clippable loss and J 7 be a hypothesis set with 0 € J 7 . A regularizer 
T on P d , that is, a function T : J 7 ' 1 —>• [0, oo), is also said to be separable, if there exists a regularizer T 
on F with Y(0) = 0 such that Y(/) = , T(/)) for f = (/i,..., /,/). Then, for d > 0, a learning 

method whose decision functions f D T € F d satisfy 

D n (fD n , t) < (?(f) + ^i,D„(/)) + dS (46) 

for all n > 1 and € (X x F)' ,r ' is called (id-approximate clipped regularized empirical risk mini¬ 
mization (dd-CR-ERM) with respect to L, T d , and Y. 

With all these preparations above, the oracle inequality for geometrically C-mixing dynamical sys¬ 
tems with i.i.d noise processes, can be stated as following: 

Theorem 4.10. Let H c M f/ be compact and (Q. A, p■ T ) be a dynamical system. Suppose that the 
stationary stochastic process T := (T n ) n >o is geometrically time-reversed C-mixing and £ = (c n )n>(i 
is some i.i.d. noise process defined on (O.C, u) which is independent ofT. Furthermore, let L : W 1 P 
[ 0 , oo) be a clippable and separable loss function with the corresponding loss function I:TxFxK-> 
[0, oo) satisfying the properties described as in Theorem 14.61 Finally, let Y : J- d —>• [0, oo) be a 
separable regularizer. Then, for all fixed f 0 = (/q, ..., /o), £ > 0, 8 > 0, r > 1, n > no as in Theorem 
\4.6\ and r € (0,1] satisfying ( 1331) . every learning method defined by ( 1461) satisfies with probability // (R> v 
not less than 1 — 16e -r ; 

(/r>„,r) + ^L,p(?D n , t) — 7Z*l,p < 2Y(/ 0 ) + 47£l,p(/ 0 ) — 47 Z* L P + 4 dr + 5 de + 2 d5. (47) 

Again, this general oracle inequality can be applied to SVMs. We omit the details for the sake of 
brevity and only mention that such applications would lead to learning rates and not only consistency as 
in ED. 
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5 Proofs 


5.1 Proofs of Section |2] 

Proof of Example 12.21 Consider the collection II of ordered n + 1-ples of points zq < z\ < ... < z n E 
Z, where n is an arbitrary natural number. The total variation of a function /:/—>• M is given by 


II/IIbv(Z) := sup ^ 2 \f(zi) - 

(Z0,2l, —>Zn)eII i=1 

Let us now assume that we have an 1 < i < n with /(z*_ 1 ) < f(zi). Moreover, for t < 0, it is not 
difficult to verify that |1 — e t \ < |t|. This implies 


g/(«i) _ g/Gi-l) 


= g f(Zi) 


l _ g/(2i-l)-/(2i) 


< ||e / |L|/(^) - f(zi-l) 


By interchanging the roles of /(zf) and /(z*_i) we find the same estimate in the case of ) > 

Consequently we obtain 


n 

e f( z i) _ g/(2»-l)l < 

i =1 



Z— 1 


for all collection II. Taking the supremum we get |e^'|| rv' < ||e-' | 


bv, he. © is satisfied. 


□ 


Proof of Example 12,31 Given a function / € Ci L „(Z), we assume that f(z) > f(z'). Again, by using 
|1 — e t \ < If I, t < 0 , we obtain 


g f{z) _ g f(P) 


= e /(2) 


l _ e /(«')-/(*) 


< ||e / |Ll/(^') - /(z)| < ||e / |L|/|ak - ^ 


By interchanging the roles of f(z ) and f(z') we find the same estimate in the case of f(z ') > /W- 
Consequently we obtain ||e^|| < He-' ||oo|/| a , i.e. © is satisfied. □ 

Proof of Theorem IZdl (=>) The proof is straightforward. 

(<t=) For p, q E [1, oo] with 1/p + l/q = 1, let Cj and £2 be Banach spaces that are continuously em¬ 
bedded into Lp(fi) and L q (p), respectively, and let £ be a Banach space that is continuously embedded 
into loo. Analysis similar to that in the proof of ll4ll Theorem 5.1] shows that if, for all n > 0, and all 
h € £ 1 , g € £2, the correlation sequence satisfies 


cor T ,n(h,g) € £, 

then there exists a constant c E [ 0 ,00) such that 

||cor T>n (/i,p)|| F < c- ||/i|| Bi ||p||e 2 , h E £r, g E £ 2 . (48) 

In particular, (1481) holds for £1 = C(Q) and £2 = f 1 (//) and the assertion is proved. □ 


5.2 Proofs of Section |3| 

The following lemma, which may be of independent interest, supplies the key to the proof of Theorem 

IQ 
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Lemma 5.1. Let Z := (Z n ) n >o be a Z-valued stationary {time-reversed) C-mixing process on the 
probability space (H, A, p) with rate (d n ) n > o, and P := pz 0 ■ Moreover, for f : Z —y [0, oo), suppose 
that f € C(Z) and write f n := f o Z n . Finally, assume that we have natural numbers k and l satisfying 

21 ■ \\f\\c ■ 4 < ||/||l i( p). (49) 

Then we have 


IE, 


■ n fjk 


< 2 


l +1 
Li(py 


j =o 


Proof of Lemma 13771 We divide the proof into two parts. 

(i) Suppose that the correlation inequality Q) holds. Obviously the case / = 0 P- a.s. is trivial. For 
/ / 0 , we define 


D, := 


E, 


n fjk E nfjk 


j =0 


J=0 


(50) 


Then we have 




^-t \ 

i-i 


i-i i 


A < 

A* 

\ 

fik - ie m JJ fjk^hk 
j= o 

i-t 

+ 

n fjk IE nfik - ]^[ IE fifjk 
j= o j= o 

i-i i-i 


= 

A* 


fik ~ IE m /jfcE^Z/fc 

J=0 

+ 

1E m fjk IEp/ife - E^/jfc E nfik 

j= o j=o 


Since the stochastic process Z is stationary, the decay of correlations (|7]l together with ^ ; = n- Jo fjk, 
h := f, and the assumption / > 0 yields 


i -1 


i -1 


IE,, fjk fik _ E /r fjk IE ^fik 


O=o 


i=o 


< 


< 


i-i 

n 

3=0 


Li(F) 


’ dk = 


i-1 

E,n/*| ii/ii** 

3=0 


l-l l-l 

IE/t fjk ~ JJ IE iifjk 
j =o i=o 


i -1 


] n/iicdfc 

j=0 


= (A-i + 


ii(^) 


C 4- 


Moreover, for the second term, we find 


/-i /-i 

n h k _ n ^f-fjk ^tifik 

j =o j=o 


ll/IUi(p) 


i-1 i-1 

fjk JJ E nfjk 
j =o j =o 


These estimates together imply that 

A < (A-i + II/IIli(p)) ll/llc dk + ||/||li(p) A-i 
= (II/IIl i{ p) + ll/llc 4) A-i + ll/llcll/lli l( p) 4- 

In the following, we will show by induction that the latter estimate implies 

A < I|/I|l i{ p) ((II/IIl i( p) + ll/llc4)' - II/IIl i( p)) • 


ll/IUi(p)A-i- 


(51) 


(52) 
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When 1 = 1, (l52l) is true because of Q). Now let Z > 1 be given and suppose (1521) is true for l. Then (I5H 
and (1521 ) imply 

A+i < (Il/lli,(p) + ll/llc4) A + ll/llcll/li^p,* 

< (II/IImp) + ll/llc 4) (II/IImp) ((II/IImp) + ll/llc 4)' - ll/llf (P) )) + ll/llcll/lli+Jp) 4 
= ll/llMP) ((ll/lkm + Il/llc4)' +1 - ll/lliW)) - 

Thus, (l52l) holds for l + 1, and the proof of the induction step is complete. By the principle of induction, 
(|52T ) is thus true for all l > 1. 

Using the binomial formula, we obtain 


d, < ii/iimp) 4 Qii/iimp, (ii/iic 4f - imip,(P)j. 

For i = 0,..., l we now set 

<*:= Q ll/lfc‘,P, (ll/llc 4)‘. 

The assumption (l49l) implies for i = 0,..., l — 1 

a-i +1 _ UJWfW^P) (ll/llc 4) + _ (-i+l)! (V—i— 1 )! ||/||c dk 

a * “ Oll/II^P) (ll/llc 4r “ mj2y. ll/ll MP) 

l-i ll/llc4 ll/llc , 1 

- i+l\\f\\ Ll(P) - ■ ||/|| Ll(P) ' fc - 2- 

This gives at < 2 _ *ao for all z = 0,..., l and consequently we have 

i i i f i \ 

^2 ai = «o + °* - a °+ ^2 2 ~ la ° = a ° ■ ( ^2 2 ~ l ) - 2a °- 

2—0 2=1 2=1 \ 2=1 / 

This implies 

a < ii/iUi(p) - nr iifi(p) J ^ ii/Hpi(p) (2 »o - ii/Hli(p)) 
= II/IU.IP) (2||/lli, ( p) - ll/lli, (P>) = ll/lli+Jp). 

Using the definition of Di we thus obtain 


i 

j=0 

(ii) Suppose that the correlation inequality © holds. 

Again, the case / = 0 P- a.s. is trivial. For / J 0, we estimate Di defined as in (l50l) in a slightly different 
way from above: 


Di< 


E, 


jo n hk - ^/ 0 e m n f jk 


0 =1 


i=i 


+ 


E m / 0 E m n - n E M /jfc 


f=i 


j=o 
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I l 

E^/o n fjk - E^/oE^ JJ f jk 

3 = 1 3 =1 


+ 


l l 

E m / 0 E m ll f jk - E m / 0 H E ,J jk 

3=1 j =1 


Since the stochastic process Z is stationary, the decay of correlations © together with h := f,(f> := 
nLi fjk > and the assumption / > 0 yields 


i i 

e m / 0 n fjk - e m / 0 e^ n / jfc 

j=i 


< ll/llcl 


n 

j =i 


£l(/0 


4 


z-i 


= IIJ lie 


< IIJ lie 


e m n f jk 4=ii/nc e^ n f jk 

3 =1 

Z-l Z-l 

E/j, JJ /jfc — JJ E^/jfc 

j=0 j=0 


4 


3=0 
l-l 

+ PJ E^/jfc J 4 

j=0 


= ll/llc (A-t + ||/fc ( P))4. 

Moreover, for the second term, since the stochastic process E is stationary, we find 


i i 

E/4oE Al fjk - E^/o E^/j-fe 
i=i j =i 


- II/IIli(p) 


E/i JJ fjk JJ E^/jfc 
J=1 j =1 


lLi(P) 

= II/IUi(p)A-i- 


Z-l Z-l 

E/i JJ fjk — JJ E^/jfc 

J=0 j=0 


Combining the above estimates, we get 

D l < ||/||c (A-i + II/II Li (P)) 4 + ||/||li(p)A-i 
= (II/IIl i( p) + ll/llc 4) A -1 + Il/Ilc||/Ili l( p) 4- 

This estimate coincides with (l5Tb . The rest of the argument is the same as in (i), and the assertion is 
proved. □ 

To prove Theorem 13.11 we need to introduce some notations. In the following, for t € R, [t\ is the 
largest integer n satisfying n < t, and similarly, \t] is the smallest integer n satisfying n > t. We write 
hi := h o Zi and 


n n 

s n = y~] hi = h o Zj. 

i =1 i=l 

We now recall the so-called blocking method. To this end, we partition the set {1,2,... ,n} into k 
blocks. Each block will contain approximatively l := [n/k\ terms. Let r := n — k ■ l < k denote the 
remainder when we divide n by k. 

We now construct k blocks as follows. Define /,. the indexes of terms in the i-th block, as 

if 1 <i< r, 
if r + 1 < i < k. 


I, = 


{i,i + k ,... ,i T- ( l + l)fc}, 
{i,i + k,...,i + Ik}, 
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Note that the number of the terms satisfies 


h 


l + 1, for 1 <i < r, 
l , for r + 1 < i < k. 


In other words, the first r blocks each contain l + 1 terms, while the last (k — r) blocks each contain l 
terms. Moreover, we have 


k r k 

Y \Ij\ = T \Ij\ + Y \Ii\ = r(l + 1 ) + (k - r)l = n. ( 53 ) 

i=l i=l i=r -\-1 

Furthermore, for i = 1, 2,. .., k, we define the z-th block sum as 

9i = Y hj ( 54 ) 

j&h 

such that 

k 

Sn = Ygi. ( 55 ) 

2=1 

Finally, for i = 1,2,..., k, define 

Pi ■= — ■ ( 56 ) 

n 

It follows from (l53l) that 


k k 

2=1 2=1 

The following three lemmas will derive the upper bounds for the expected value of the exponentials 
of S n . 

Lemma 5.2. Let Z := (Z n ) n >q be a Z-valued stationary stochastic process on the probability space 
(Q. A , //) and P := pz< y Moreover, let k and l be defined as above, and for a bounded h : Z —>• M we 
define g, and S n by d54D and 455D . respectively. Then, for all t > 0, we have 

*f) sE«^ ex p(‘]fi)- 

Proof of Lemma [T2l It is well-known that the exponential function is convex. Jensen’s inequality to¬ 
gether with Yh =1 Pi = 1, (1551 ). and (l56l) yields 

E^exp ex P 




□ 
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Lemma 5.3. Let Z := (Z n ) n >o be a Z-valued stationary (time-reversed) C-mixing process on the 
probability space (tl,A,p) with rate (d n ) n > o, and P := pz 0 - Moreover, for h : Z —» [0, oo), we write 
h n := ho Z n . Finally, let k and l be defined as above. Then, for all t > 0 satisfying 


e\h\ h € C(Z) and 21 ■ \\eP\ h \\ c . d k < ||el*d ft || Li(p) , (57) 

we have 

E^exp (t < 2 ^Ep exp 

Proof of Lemma 15.71 The / tli block sum g t in (f54~b depends only on hi + j k with j ranging from 0 through 
\Ii\ — 1. Since Z is stationary, Lemma [5711 with / := exp(jj-^h) then yields 

/ \ / t |T|-i \ / t |T|-i \ 

E / t exp I fry- ) = E^exp — ^ h i+jk = E^exp — ^ h jk 

\ I il/ \J i\ j=0 J \J j =o / 



□ 


Lemma 5.4. Let Z := (Z n ) n >o be a Z-valued stationary (time-reversed) C-mixing process on the 
probability space (Fl,A,p) with rate (d n ) n > o, and P := pz 0 - Moreover, for h : Z —» [0, oo), we write 
h n := h o and suppose that E ph = 0, ||T|| < A, ||/i||oo < B, and E ph 2 < a 2 for some A > 0, 
B > 0 and a > 0. Finally, let k and l be defined as above. Then, for all i = 1,,k> al, d all t > 0 
satisfying 0 < t < 31 /B and d57D . we have 

E “ eXP (*lfl) ~ 2eXP ( 20 - tg/3) ) - 

Proof of Lemma \5A\ Because of \\h\loo < B and 2 • 3 J 2 < j\, we obtain 


exp 




j hi 
j! 



h 2 Bi~ 2 
2 • 3F2 


1 ( t 


h2 D 


1 t , 1 / t 

— 1 T -mh + 


tB 


j~ 2 


2 V IT 


1 ~ tB/ (3|T|) 


if tB /{2>\Ii\) < 1 . This, together with E ph = 0 , 1 + x < e®, and l < |T| < l + 1, implies 


Ep exp ( t—- 


IT I 


< I l + U-^rr 1 a 


2 V IT 


IT| 


1-£B/(3|T|) 


< exp 


1 ( t 


2 V IT 


IT I 


a 


1 -fT/(3|T|) 
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= exp 
< exp 


t 2 a 2 


2(\Ii\-tB/3) 

t 2 d 2 \ 

2(l-tB/3)J ’ 


(58) 


since the assumed tB/(3l ) < 1 implies tB/(3\Ii\) < 1. Lemma [531 then yields 


E^exp ( tjj-j ) < 2 exp 


Proof of Theorem 13771 For k and l as above we define 


t 2 o 2 \ 

2(l-tB/3) J 


t := 


le 


Then we have 


t t 
< - = 


o 2 +eB/ 3' 


< 


\Ii\ l o 2 + eB/ 3 eB / 3 B' 

In particular, this t satisfies 0 < t < 31/B. Moreover, we find 


exp 


\h I 


rh 


< exp 


B 


• B = e d 


Then, the assumption ([5]) together with the bounds (16TI ) and (l60l ) implies 


exp [ -j—/i 


Since —B<h<B, we further find 



/ t \ 


t _ 

< 

exp VTI 7 

oo 

| 


< e 


\Ii\ 


< 


3e 3 A 


B 


exp ( -r—rh 


Li{P) 


= Ep exp ( —h ) > exp ( — • (— B) ) = e 




□ 


(59) 


(60) 


(61) 


(62) 


(63) 


Now we choose k := [(log n) y J +1, which implies k > (log n) y. On the other hand, since (log n) y > 1 

2 

for n > no > 3, we have k < 2(log n) y. This implies 


, n — r n In In 

l = —— > - - 1 >- T - 1 >- 

k k 2n„„„^ 4 


(log n) 


2 5 

(log n) y 


(64) 


since we have n > 4(log n) y for n > no- Now, by (16Tb . (l62l) . (l63l) . ©, and (fl3l) we obtain 

11 ^ h I, .. ^ h .. .. ^ h .. 

\\r el J <l L + el J <l 

l ■ -—-• dk < l ■ - — -• c • exp (— bk 7 ) 


i tMi 


ii(i’) 


|e |/i|/l ||Li(P) 


< n 


< n 


a 3 , 3e 3 A 
+ B 

3 


404c(3A + 5) 
B 


c • exp (—6(logn) 2 ) 


• exp ( —6 log n 


n J 


< n • — • n 3 = 
“2 2 
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i.e., the assumption (1571) is valid. 

Summarizing, the value of t defined as in (l59l) satisfies 0 < t < 31/B and the assumption (1571) . In 
other words, all the requirements on t in Lemma [5~4| are satisfied. 

Now, for this t, by using Markov’s inequality, Lemma [5T2l and Lemma [5~4l we obtain for any e > 0, 


Sr, 


n 


P — > £ = PI exp t — > exp (te) 


Sr, 


n 


< exp (—te) exp ft— 

k ( 

< exp (—te) ^2 ex P ( t TTT 


i= 1 


< exp (—te) • 2 exp 


t 2 a 2 \ 


2(l-tB/3) J ^ 


Z 

7 i=1 


Pi 


, t 2 a 2 \ 

= 2exp(- te+ 2(i _ (B/3) j. 


Substituting the definition of t into the exponent of inequality (1651) . we get 


t 2 c r 2 le 2 l 2 e 2 a 2 

t£+ 2(1- tB/3) = ~ a 2 + eB /3 + + eB/3) 2 ' 2 ^ _ J^/3 J 

le 2 le 2 a 2 

= ~ a 2 +eB /3 + a 2 + eB /3 ‘ 2 (a 2 + eB/3 - eB/3) 
-le 2 

~ 2 (a 2 + eB/3) ’ 


(65) 


hence 


P { —S n > e ) < 2 exp ( — 


-le 2 


2 (a 2 + eB/3) 


■ 


Using the estimate (l64~b . we thus obtain 


1 


ne 


P ( — S n > e < 2 exp- 

n 7 \ Silogn)^ (a 2 + eB/3), 


for all n > no and e > 0. Setting t := 


8(logn) 7 (cr 2 +eS/3) 


-, we then have 


/r < a; 


G U : — h(Zi(u)) > £ > j < 2e r , n > 


P n 0 . 


j=i 


Simple transformations and estimations then yield 




w 


1 n 

€ fi : - V/ifZiH) 

n rr' 


> 


8 (log n( 


T'TCT 2 


+ 


i=l 


n 


8(log n) 7 2?t 
3 n 


< 2e“ 


for all n > no and t > 0. 


□ 
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5.3 Proofs of Section 01 

Proof of Lemma \4~7\ (i) For the least square loss (l23l) . by using a + b < (2 (a 2 + 6 2 )) 1 ,/2 , we obtain 
I L(x, y , /(x)) - L(x', y', f(x'))\ = |(y - /(x )) 2 - ( ; y ' - /(x')) 2 | 

= b - /(®) + 2 / - /(*')l • b - /(*) - 2 / + /(®')l 

< (b + 2/1 + 1/0*0 + f( x 01) (b - 2/1 + 1/0*0 - /(®')|) 

<2 (M + H/lloo) (b - 2/1 + |/|i|® - ®1) 

< 2 (M + H/lloo) (1 + |/|i) (b “ 2/1 + I® - ®1) 

< 2\/2 (M + ll/Hoo) (1 + |/|i)||(x,?/) - (x',y')ll 2 

for all (x, y), (x 2 . y') € X x Y", that is, we have proved the assertion. 

(ii) Let L be the the r-pinball loss (l24l) and define 

D := L(x, y, /(x)) - L(x', y', /(x'))> 

We divide the proof into the following four cases. If y > /(x) and y' > f(x'), we have 
\D\ = |r(y - /(x)) - r(y' - /(x'))| = r|(y - y') - (/(x) - /(x'))|. 

If y < /(x) and y' < f(x'), in an exactly similar way we obtain 

1^1 = (1 - T )\(y - 2/0 - (/(®) - /O0)l- 

Moreover, in case of y > /(x) and y' < f(x'), we get 

1-01 = 17"( 2 / “ /(*)) + (! - T )(y' - f(x 0)1 < 1 ( 2 / - /(z)) + (/(®0 - 2 / 01 - 

Similar arguments to the case y < /(x) and y' > fix 1 ) show that 

|0| = | — (l-r)(y-/(x))-r(y'-/(xO)| < \(y - f(x)) + (f(x') - y')\. 
Summarizing, for all (x, y), (x', y') G X x Y, we have 

|L(x,y,/(x)) - L(x',y',/(x0)| < 1(2/- 2/0 - (/(®) - /(^0)l < b — 2/1 + |/(*) - f(x')\. 

The rest of the argument is similar to that of part (i), and the assertion is proved. □ 

For our proof of Theorem 14.61 we need the following simple and well-known lemma (see e.g. ll42l 
Lemma 7.1]): 

Lemma 5.5. For q £ (1, 00 ), define q' € (1, 00 ) by 1/q + 1/q' = 1. Then, for all a, b > 0, we have 
{qa) 2 l q {q'b) 2 l q ' < [a + b ) 2 and ab < a q /q + b q '/q'. 

Apart from the semi-norm bounds involving Aq, A\, and A* and some constants, for example, the 
constant no and the constants on the right side of the oracle inequality, the proof of Theorem l4.6l is almost 
identical to the proof of l23l Theorem 3.1]. For this reason, a few parts of the proof will be omitted. 

Proof of Theorem \4~6\ Main Decomposition. For / : X —>• R we define hj := L o f — L o /£ p . By 
the definition of fD n ,r, we then have 

T(/d„,t) + ^D n h ?Dn r < T(/ 0 ) + E Dn h f0 + 6, 
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and consequently we obtain 


Y(/d„,t) + ft L,p{fD n ,r) - fc*L,p 
= t (/b„, T )+E p /i^ t 

< T(/o) + E Dn h fo - ^ + E phj Dn T + 4 

= (T(/o) + Epfi/ 0 ) + (Ep n hf 0 - E P h fo ) + (E phj^ r - E Dn hj-^ + S. (66) 

Estimating the First Stochastic Term. Let us first bound the term E D n hf 0 — E phf 0 . To this end, 
we further split this difference into 

Er>„ hf 0 - E P h fo = (E Dn (h fo - hj o ) - E P (h fo - + (E D n hj Q - Eph~). (67) 

Now L o f 0 — L o f 0 > 0 implies hf 0 — hj = L o / 0 — L o f 0 E [0, Bo], and hence we obtain 

E P ((h f0 - hj Q ) - E P (/r /o - hjJ} < E P (h fo - hjJ 2 < B 0 Ep(h fo - hjJ. 

Moreover, we find 


\\hf 0 ~ h To || = || (L o / 0 - L o f* Lp ) -(Lo/ 0 -Lo f* Lp )\\ 

= \\Lofo-LofoW < ||Lo/o|| + ||Lo/ 0 || <2^o- 

Inequality (fl5l ) applied to h := (hj 0 — h ~ ) — Ep(/i/ 0 — hj ) thus shows that for 


^ ^ I ■ I ^ o 2 \ 808c(6A 0 + Bo) A m 

n > n 0 > max < mm < m > 3 : m > -—-and 


Bn 


(log m) 


>4,e> , 


we have 


Ed„(Vq ~ h f 0 )~ Ep (hf 0 ~h~)<\ 


8(log n )i rB 0 E P (/ry 0 - h-) 8 (logn)^B 0 r 


n 


3 n 


with probability y, not less than 1 — 2e T . Moreover, using yfab < | + |, we find 


8(logn)T , n _1 rBoEp(fiy 0 — hjJ < E p(hf 0 — h~J + 2(logn)Tn 


/o' 


and consequently we have with probability y not less than 1 — 2e T that 

2 

14floe: n) Bot 

Ed„(Vo - h To ) - Ep(fi /0 - h % ) < E p(h f0 - h ?o ) + 1 S 3ra j (68) 

In order to bound the remaining term in (1671) . that is E D n hj — E phj , we first observe that the assumed 
L(x,y,t) < 1 for all (x,y) € X x Y and t,t' € [— M,M] implies \\hj ||oo < 1, and hence we have 
II hp — E php |loo < 2. Furthermore, we have 

JO JO 


\\hf 0 1| — \\L o fo — L o fl P \\ < ||L o /o|| + ||L o /£ jP || < Aq + A*. 
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Moreover, (I3TI) yields 


E P (h 7 - E P h 7 f < E P h 2 ~ < V{E P h 7 ) & . 

JO JO fo JO 

In addition, if $ € (0,1], the second inequality in Lemma [531 implies for q := ppj r q' := |, a := 

((log n)~n~ 1 2 3 ~ l3 ^Vt) 1 / 2 , and b := (2i? _1 E phj )^/ 2 , that 


\ 


8(logn)7l/r(Ep/i- 


n 


<{1 -2 


i}\ I 2 3 ’V 1 *(log n)~<VT 


i 

2 —-t? 


n 


+ E ph 


fo 


< 


/ - ' 
( 8(logn)i , I/r 




i 

2-i9 


n 


+ E ph 7 . 
fo 


Since E phj > 0, this inequality also holds for 0 = 0, and hence (fT5l) shows that for 


^ ^ ^ o 2 ^ 808c(3A) +3A*+ 2) J m 1 a 

n > n 0 > max < mm <m>6:m > -and- 7 > 4 ^ , e 1 


(log m) 


we have 


E D n h 7 — E ph 7 < E ph 7 + 
fo fo fo 


/ — N 
I 8(log 




n 


L_ 2 

2 * 16(log n)~T 

H 3n 


(69) 


with probability ji not less than 1 — 2e T . By combining this estimate with (l68l) and (l67l) . we now obtain 
that with probability ji not less than 1 — 4e -T we have 


E D n hf 0 - E P h fo < E ph fo + 


( 8(log n)^Vr 


2 l ' + 16(log n)~*T + 14(log u)^Bqt ^ 


3 n 


3 n 


since 1 < Bq, i.e., we have established a bound on the second term in i 

Estimating the Second Stochastic Term. For the third term in ( |66] > let us first consider the case 
n/ (log n) 7 < 8(r+^(e/2)2 p ?’ p ). Combining (l70l) with (l66l) and using 1 < Bq, 1 < V, and E ph' 


En h 7 <2, then we find 

/d», t 


fD n , T 


Y(fD n ,r) + ^L,p(/D n , t) - Tl*L,I 


E T(/ 0 ) + 2E phf 0 + 


( 8 (log n)~iVT 


1 

2-0 


2 2 
16(log n)n 14(logn)'i'5or 


3 n 


3 n 


+ ( - Eph fn n ,r~ ¥ ‘ D - h fD n ,r ) + S 


2 1 

^8(logn)7I/(r + (p{e/2)2 p r p )^ “ ^ t IO(logn)7i7 0 T 


E T(/ 0 ) + 2E phf 0 + 


+ 


+ 2 


/ 8(log n)^V(r + tp(e/2)2 p r p ) 

V " , 


1 

2-0 


+ 5 


<2T(/ 0 )+4Ep/i /o + 3 


24(log n) t V(t + ip(e/2)2 p r p ) 
3 n 


1 

2-tf 


10(logn)i'-B 0 r 
H-h 26 


n 
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< 2T(/ 0 ) + 4E ph fo + 4r + 26 

2 

with probability /i not less than 1 — 4e _r . It thus remains to consider the case n/(log n) i > 8(r + 

ip(e/2)2 p r p ). 

Introduction of the Quotients. To establish a non-trivial bound on the term Ep/vp -En E in 

JD n ID 

(1661) . we define functions 


9f,r 


Eph f~ h J_ 

T{/) + Eph~+ r' 


f £ F, r > r*. 


For / € J'ti we have \\Kphj — hj ||oo < 2. Furthermore, for / € T r and k > 0 with 2 k r < 1, by the 
assumption (l29l) we find 

ll^pll = II L ° f — L ° /£p|| 5: \\L o f || + || L o fi p\\ < A 2 k r + A* < A\ + A*. 

Moreover, for / E J>, the variance bound (|3TT) implies 


E P (h~- E phjf < E P h 2 j< V{K P hjf < Vr ■*. (71) 

Peeling. This part is completely identical to the part Peeling on page 135 of our work l23ll . Hence 
we have neglected some steps of the derivations. In case of uncertainty one may refer to lf23l for details. 

For a fixed r € (r*, 1], let K be the largest integer satisfying 2 h r < 1. Then we can get the following 
disjoint partition of the function set T \: 


K+l 

J~ l C J~ r U u (J- 2k r \J~2 k ~ 1 r) ■ 

k =1 


(72) 


We further write CE r ,o for a minimal e-net of J> and C FFr ± for minimal e-nets of F 2 k r \F 2 k-i r , 1 < 
k < K + 1, respectively. Then the union of these nets C e ,\ := UpJn' Ce.r.k is an e-net of the set F\. 
Moreover, we define 

k 

C £ , r ,k ■= U Ce,r,h 0 < k < K + 1. (73) 

1=0 

Then we have C £ ,\ = (Jr Jo' Cs,r.k- Moreover, the cardinality of C F . r .k can be estimated by 

\Ce,r,k\ < (fe + l)exp (ip(e/2)2 kp r p ^ , 0<k<K + l. (74) 

Then, peeling by ll23l Theorem 5.2] implies 

P ( sup E Dn g f) r > < 2 ^2 F [ sup ^D n i^P h ?~ h ?) > 2 k 3 r \ ■ (75) 

V/GCe,i 4 / k =i \feC',T,h J 

Estimating the Error Probabilities on the “Spheres”. Our next goal is to estimate all the error 
probabilities on the right-hand side of (f75l) . By our construction, we have C FFr .k C F 2 k r . This, together 
with (fl4l) . ([Til . the union bound and the estimates of the covering numbers d74l) . implies that for 


n > Hq > max < min < m > 3 


m 


> 


808c(3Hi + 3H* + 2) 


and 


m 


(log m) : 


> 4 


J 
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we have 


/J sup E,D n (^phj - hj) > 2 k 3 r 

\f£Ge,r,k / 


— 2|C £ir ^| exp [ 


n 


( 2 fc 3 r )2 


8 (log n)^ V{2 k r)® + 2(2 fc_3 r)/3^ 


< 


2(fc + 1) exp ^(e/2)2 fcp r 


n 


3(2 fc -V) 2 


r •exp 


8(logn)7 96^(2fc-ir)^ + 8(2fc-ir) 

since 0 € [0,1]. For A: > 1, we denote the right-hand side of this estimate by pk{r), that is 

3( 2 fe-t r ) 2 


( 76 ) 


Pk(r) := 2(k + l)exp (y(e/2)2 kp i 


n 


r I•exp 


8(logn)7 96V(2 k ~ 1 r)' s + 8(2 fc-1 r) 


(77) 


Then, as derived in 11231 , we can obtain 


/ \ Pk+i(r) 

qk{r) ■■= - rT - < 2 exp 

Vk(r) 


(p(e/2)2 kp+1 r l 


n 


3(2 k ~ l r) 2 


■ exp 


8(logn)7 9617(2 fe_1 r) 1? + 8(2 fc_1 r) J 


and our assumption 2 k r < 1, 0 < k < K implies 

q k (r)< 2exp (ip{e/2)2 kp+1 r p 


n 


3(2 k ~ l r ) 2 


• exp 


< 2 exp • 4r p (p(e/2) - 


K 8(log n)7 96F(2 fe - 1 r)^ + 8(2^r^ 

2 (fe-1)(2—«?) . 3nr 2-# \ 


64(1217 -f l)(logn) 


Since p € (0,1], k > 1 and $ £ [0,1], we have 2^ k < 2^ k 4 ^ 2 Then the first assumption in 

namely, 


r > 


/512(1217 + l)(logn)T (r + (p(e/2)2 p r p ) 

V to , 


implies that 3 nr 2 d > 512(1217 + l)(logn)'i'</?(e/2)r p . By using 2^ k 1 ^ 2 ^ > 1, we find 


Qk(r) < 2exp -- 


3 nr 


2-tf 


128(1217 + 1)(log n )i J 

2 

Moreover, since r > 1, the first assumption in (1331) implies also 3 nr 2 ~® > 4 • 128(1217 + l)(logn)7. 
Hence we have %(r) < 2e -4 , that is, 

Pk+i{r) < 2e~ 4 p k (r) for all 7c > 1. (78) 

Summing all the Error Probabilities. Now, combining (1751) with (l76l) . (fTTb . and (l78l ). we obtain 


K+l 


p sup E Dn gf,r > T < 2 ^2 Pk(r) < 3pi(r) 
V/GCe, 1 V /,:= 1 
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= 12 exp (tp(e/2)2 p r p ) ■ exp — 


n 


3 r 1 


8 (log n)y 96 + 8r ^ 


< 12 exp (ip(e/2)2 p r p ) - exp — - 


3 nr 


2-i? 


64(121/ + l)(logn)T / 

where in the last step we used r € (0,1] and d € [0,1]. Then once again the first assumption in (l33l) 

2 

gives 3 nr 2 ~ d > 64(121/ + l)(log n)~ (r + ip(s/2)2 p r p ) and a simple transformation thus yields 
/i ( D n € (X x y) n : sup E Dn g f , r < \ j > 1 - 12e- r . 

V feCs, i V 


The rest of the argument is completely analogous to the proof of 1231 Theorem 3.1] and the assertion is 
proved. □ 

Proof of Theorem \4~8\ For the least-square loss, the variance bound (I3TT) is valid with 0 = 1, hence the 
condition (1331) is satisfied if 


> max < (cy2 1+3p a) 1-p a x -p A !-p 


n 


i 

i-p 


(log n 


p_ 2cy (log n) i t 20Bq (log n) ~< r 


£ i-P 


n 


n 


(79) 


Furthermore, I2TI Section 2] shows that there exists a constant c > 0 such that for all a € (0,1], there is 
an / 0 € H a with ||/ 0 ||oo < c, \\fo\\ 2 H 


< C, ll/oll< CO d , and 


n L ,p(f 0 )-ni P <co 2t . 

Moreover, l(4Tl Lemma 5.5] shows every function / in H a is Lipschitz continuous with 


and this implies 


It ^ y/2o l \\f\\ H(T (x) 


l/o|i < l/oli < V2o 1 \\fo\\H a (x) < V2co 1 . 


Moreover, there exists a constant C* < oo such that |/£ p|i < C'*, since we have assumed that /£ p € 
Lip(M d ). Then, Lemma l4~7l (i) yields 


4j4q + Ai + A* + 1 — 2\/2 (M + ||/||oo) ( 4 + 4|/o|i + 1 + sup |/|i + 1 + \fL,p\i + 1 j + 1 


< 2v/2(M+ H/lloo) ( 7 + 4\/2c<7 1 +supV^<r *A 1 / 2 r 1 / 2 + (7* ) + 1 

V r<l 

= 2^2 (M + H/lloo) (7 + h^ar^A" 1 / 2 + C*) + 1 

< 2C y (T _ 1 A~ 1/2 < 2Cn 

for all (j, A € (0,1] with Act 2 > n~ 2 , where C is a constant independent of n. A, and < 7 . For 


n > max < 2C, min <m> 3 : 


m 


(log m) 


>4 ,e^ , 
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the oracle inequality (l34l) thus implies 


mD n ,\W 2 H a + T^L,pUd„, a ) - T^*L,P 
< 4A||/o||^ ct + 47£i j p(/o) — 47^2 ,p + 4r + 5e 

n 


< Ci I \<j~ d + a zt + — 


2 1 


, (log n) 



_ 2 P_ 

e i-pr + e 


where Ci is a constant independent of n, A, a, r, and e. Here, A, a, and n need satisfy the additional 
requirement a, A € (0,1] with Aer 2 > n 2 . Now, optimizing over e by using l42l Lemma A.1.5], we get 


MI/d^Wh^ +'R-L,p(fD n ,\) - T^*L.P < C 2 


Xa~ d + a 2t 


+ a 


d 

!+P A 


P 

1+P 


n 


i+p 


, (log n)' 


(80) 


where C 2 is a constant independent of n, A, and a. By applying ll42l Lemma A. 1.6], we can optimize the 
right-hand side of (l80l) over A and a, then we see that for all £ > 0 we can find p, ( E (0,1) sufficiently 

i 

close to 0 such that the LS-SVM using Gaussian RKHS H„ and A n = n i , a n = n 2t + d leai'ns with 
rate n~ 2t + d+£> , since the requirement A n <r 2 > rC 1 is automatically satisfied by the assumed t > 1. □ 

Proof of Theorem \4.10\ Theorem 14.61 yields 


T(j ■ o) ) + E P h ? < 2T(/ 0 ) + 4M P h fo + 4r + 5e + 2<5 
n ’ LW) T 

with probability not less than 1 — 16e _T . Using (1431 ) and the definition (l42l ) we then easily obtain 
the assertion. □ 
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